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In biomedical studies it is of substantial interest to develop risk 
prediction scores using high-dimensional data such as gene expression 
data for clinical endpoints that are subject to censoring. In the pres- 
ence of well-established clinical risk factors, investigators often prefer 
a procedure that also adjusts for these clinical variables. While accel- 
erated failure time (AFT) models are a useful tool for the analysis of 
censored outcome data, it assumes that covariate effects on the log- 
arithm of time-to-event are linear, which is often unrealistic in prac- 
tice. We propose to build risk prediction scores through regularized 
rank estimation in partly linear AFT models, where high-dimensional 
data such as gene expression data are modeled linearly and important 
clinical variables are modeled nonlinearly using penalized regression 
splines. We show through simulation studies that our model has bet- 
ter operating characteristics compared to several existing models. In 
particular, we show that there is a nonnegligible effect on prediction 
as well as feature selection when nonlinear clinical effects are mis- 
specified as linear. This work is motivated by a recent prostate cancer 
study, where investigators collected gene expression data along with 
established prognostic clinical variables and the primary endpoint is 
time to prostate cancer recurrence. We analyzed the prostate cancer 
data and evaluated prediction performance of several models based 
on the extended c statistic for censored data, showing that (1) the re- 
lationship between the clinical variable, prostate specific antigen, and 
the prostate cancer recurrence is likely nonlinear, that is, the time to 
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recurrence decreases as PSA increases and it starts to level off when 
PSA becomes greater than 11; (2) correct specification of this nonlin- 
ear effect improves performance in prediction and feature selection; 
and (3) addition of gene expression data does not seem to further 
improve the performance of the resultant risk prediction scores. 

1. Introduction. In biomedical research it is of substantial interest to 
build prediction scores for risk of a disease using high-dimensional biomarker 
data such as gene expression data for clinical endpoints subject to censor- 
ing, for example, time to the development or recurrence of a disease. This 
process typically involves a feature selection step, which identifies impor- 
tant biomarkers that are predictive of the risk. When some clinical variables 
have been established as the risk factors of a disease, it is preferred to use 
a feature selection procedure that also accounts for these clinical variables. 
Using observed data with censored outcomes, our goal is to build risk pre- 
diction scores using high-dimensional data through feature selection while 
simultaneously adjusting for effects of clinical variables that are potentially 
nonlinear. 

1.1. A prostate cancer study. This article is motivated by a prostate 
cancer study. An important challenge in prostate cancer research is to de- 
velop effective predictors of future tumor recurrence following surgery in or- 
der to determine whether immediate adjuvant therapy is warranted. Thus, 
biomarkers that could predict the likelihood of success for surgical therapies 
would be of great clinical significance. In this study, each patient underwent 
radical prostatectomy following a diagnosis of prostate cancer, and their rad- 
ical prostatectomy specimens were collected immediately after the surgery 
and subsequently formalin-fixed and paraffin-embedded (FFPE). More re- 
cently, the investigators isolated RNA samples from these specimens and 
performed DASL (cDNA-mediated Annealing, Selection, extension and Lig- 
ation) expression profiling on these RNA samples using a custom-designed 
panel of 1,536 probes for 522 prostate cancer relevant genes. The DASL 
assay is a novel expression profiling platform based upon massively multi- 
plexed real-time polymerase chain reaction applied in a microarray format, 
and, more importantly, it allows quantitative analysis of RNA from FFPE 
samples, whereas traditional microarrays do not [Bibikova et al. (2004); 
Abramovitz et al. (2008)]. In addition, important clinical variables were also 
collected, two of which, prostate specific antigen (PSA) and total gleason 
score, are known to be associated with prostate cancer risk and prognosis 
and are of particular interest. The primary clinical endpoint in this study is 
time to prostate cancer recurrence. The research questions of interest include 
the following: (1) identifying important probes that are predictive of the re- 
currence of prostate cancer after adjusting for important clinical variables; 
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(2) constructing and evaluating risk prediction scores; and (3) determining 
whether the inclusion of the gene expression data improves the prediction 
performance. It was also suspected that PSA may have a nonlinear effect 
on the clinical endpoint. In this article we will develop and apply a new 
statistical model, which allows us to answer these questions. 

1.2. Feature selection and prediction in AFT. The accelerated failure 
time (AFT) model is an important tool for the analysis of censored out- 
come data [Cox and Oakes (1984); Kalbfleisch and Prentice (2002)]. Com- 
pared to the more popular proportional hazard (PH) model [Cox (1972)], 
the AFT model is, as suggested by Sir David Cox [Reid (1994)], "in many 
ways more appealing because of its quite direct physical interpretation," 
especially when the response variable is not related to survival time. Fur- 
thermore, when prediction is of primary interest, the AFT model is arguably 
more attractive, since it models the mean of the log-transformed outcome 
variable, whereas the Cox PH model estimates the hazard functions. 

Classic AFT models assume that the covariate effects on the logarithm 
of the time-to-event are linear, in which case one could use standard rank- 
based techniques for estimation and inference [Tsiatis (1990); Ying (1993); 
Jin et al. (2003)] and perform a lasso-type [Tibshirani (1996)] variable se- 
lection [Johnson (2008); Cai, Huang and Tian (2009)]. Regarding existing 
variable selection and prediction procedures, there are two unsatisfying prod- 
ucts. First, the linearity assumption may not hold in real data. For example, 
Kattan (2003a) showed that relaxing the linearity assumption of the Cox PH 
model improved predictive accuracy in the setting of predicting prostate can- 
cer recurrence with low-dimensional data. Second, an unsupervised imple- 
mentation of the regularized variable selection procedure can inadvertently 
remove clinical variables that are known to be scientifically relevant and 
can be measured easily in practice. We will address both concerns in our 
extensions of AFT models. 

1.3. Partly linear models. It has been well established that linear regres- 
sion models are insufficient in many applications and it is more desirable to 
allow for more general covariate effects. Nonlinear modeling of covariate ef- 
fects is less restrictive than the linear modeling approach and thus is less 
likely to distort the underlying relationship between an outcome and co- 
variates. However, new challenges arise when including nonlinear covariate 
effects in regression models. In particular, nonparametric regression meth- 
ods encounter the so-called "curse of dimensionality" problem, that is, the 
convergence rate of the resulting estimator decreases as the dimension of the 
covariates increases [Stone (1980)], which is further exacerbated when the 
dimension of the covariates is high. The partly linear model of Engle et al. 
(1986); Hardle, Liang and Gao (2000); Ruppert, Wand and Carroll (2003) 
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provides a useful compromise to model the effect of some covariates nonlin- 
ear ly and the rest linearly. Specifically, for the ith subject, let Tj be a uni- 
variate endpoint of interest for the ith subject, and Zj = (zj; , • • • , z\ ) T 

(d x 1) and Xj = (X^\ . . . , A^) T (q x 1) denote high-dimensional features 
of interest (say, gene expression levels) and established clinical variables, 
respectively. Then one partly linear model of interest is 

(1) T i = (f ) (X t )+ti T Z t + e i , 

where i9 = ($i, . . . , $d) T is a parameter vector of interest, <j) is an unspecified 
function, and the errors (gj) are independently and identically distributed 
(i.i.d.) and follow an arbitrary distribution function F e . Special cases of 
this model have been used in varied applications across many disciplines 
including econometrics, engineering, biostatistics and epidemiology [Hardle, 
Liang and Gao (2000)]. In this article we consider Model (1) for subject 
to right-censoring, and, hence, the observed data are {(Tj,<5j,Zj,Xj)}™ =1 , 
where Tj = min(Tj, Cj), <5j = I(Ti < Cj), and Cj is a random censoring event. 
We note that Tj is the log-transformed survival time in survival analysis, 
and we refer to Model (1) as partly linear AFT models. 

In the absence of censoring, the nonparametric function <f> in Model (1) 
can be estimated using kernel methods [Hardle, Liang and Gao (2000), refer- 
ences therein] and smoothing spline methods [Engle et al. (1986); Heckman 
(1986)]. For partly linear AFT models, one can extend the basic weighting 
scheme of Koul, Susarla and van Ryzin (1981), where one treats censoring 
like other missing data problems [Tsiatis (2006)] and inversely weights the 
uncensored observations by the probability of being uncensored, that is, so- 
called inverse-probability weighted (IPW) estimators. A close cousin to the 
IPW methodology is censoring unbiased transformations [Fan and Gijbels 
(1996), Chapter 5 and references therein], which effectively replaces a cen- 
sored outcome with a suitable surrogate before complete-data estimation 
procedures are applied. Both IPW kernel-type estimators and censoring un- 
biased transformations in the partly linear model have been studied for AFT 
models [Liang and Zhou (1998); Wang and Li (2002)]. Since both aforemen- 
tioned approaches make stronger assumptions than rank estimation of AFT 
models [Cai, Huang and Tian (2009)], we focus on extending rank estimation 
to meet our needs. 

We here consider a general penalized loss function for partly linear AFT 
models 

(2) min +<yj(<f>), 

where C n is the loss function for observed data and J ((f)) imposes some type 
of penalty on the complexity of (f>. Our approach is to replace C n with the 
Gehan (1965) loss function [Jin et al. (2003)] and model <f) using penalized 
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regression splines; our focus is to build risk prediction scores. To minimize 
the penalized loss function (2), the insight into the optimization procedure 
is due, in part, to Koenker, Ng and Portnoy (1994), who noted that the op- 
timization problem in quantile smoothing splines can be solved by Li-type 
linear programming techniques and proposed an interior point algorithm for 
the problem. Li, Liu and Zhu (2007) built on this idea to propose an entirely 
different path-finding algorithm for more general nonparametric quantile re- 
gression models. Along similar lines, when </(</>) is taken as a L\ norm as 
in penalized regression splines [Ruppert and Carroll (1997)], the optimiza- 
tion problem of (2) is essentially an L\ loss plus L\ penalty problem, and 
can also be solved by Li-type linear programming techniques, which will 
be exploited in our approach to the optimization problem. Once the basic 
spline framework is adopted, we show that our estimator can be generalized 
through additive models for q > 1 and variable selection in the linear compo- 
nent. The additive structure of nonlinear components [Hastie and Tibshirani 
(1990)] is adopted to further alleviate the issue of curse of dimensionality. 
To the best of our knowledge, there is no similar work in the partly lin- 
ear or partly additive model for censored or uncensored data using Cox or 
AFT models, and we are the first to conduct systematic investigation on the 
impact of misspecified nonlinear effects on prediction and feature selection 
using AFT models for high-dimensional data. 

More recently, Chen, Shen and Ying (2005) proposed stratified rank esti- 
mation for Model (1) and Johnson (2009) proposed a regularized extension. 
However, their stratified methods are fundamentally different from ours in 
several aspects. First and foremost, the stratified estimators do not provide 
an estimate of the nonlinear effect of the stratifying variable, namely, <j)(X), 
and, hence, the lasso extension proposed by Johnson (2009) focused on vari- 
able selection only It is evident that 4>{X) plays an important role in predic- 

tion; since the stratified estimators in Johnson (2009) can only use i? Z for 
prediction, their performance suffers, which will be shown in our numerical 
studies. By contrast, our approach provides an estimate of </>(A"), which in 
turn can be used to improve prediction performance. Second, the numerical 
algorithm proposed in Johnson (2009) can only handle the case of d < n and 
their numerical studies are limited to such cases, whereas we here investigate 
the high-dimensional settings with d> n. Third, as will be shown in our nu- 
merical results, our proposed method outperforms the stratified estimators 
in feature selection as well. 

The rest of the article is organized as follows. In Section 2 we present the 
details of the methodology. In Section 3 we investigate the operation charac- 
teristics of the proposed approach through simulation studies. In Section 4 
we analyze the prostate cancer study and provide answers to the research 
questions of interest. We conclude this article with some discussion remarks 
in Section 5. 



G 



LONG, CHUNG, MORENO AND JOHNSON 



2. Methodology. 

2.1. Regression splines in partly linear AFT model. We first consider 
a simplified case for the partly linear AFT model (1), where Xj is assumed 
to be univariate, that is, q = 1 and Xj = Xi, and then Model (1) reduces to 

(3) T i = 4 ) {X i )+ti T Z i + e i . 

Let B(x) = {Bi(x), . . . ,Bm(x)} t (M < n) be a set of basis functions. We 
use a regression spline model for </>(•), which asserts that 4>{x) = B(x) T /3, for 
some (3 € $l M . Popular bases include i?-splines, natural splines and trun- 
cated power series basis [Ruppert, Wand and Carroll (2003)]. As explained 
in Section 2.2, we will use the truncated power series basis of degree p with- 
out the intercept term, that is, B(x) = {x, . . . , x p , (x — ki)\, ■ ■ ■ , (x — K r )+} T , 
where (ki,...,«v) denotes a set of r knots, and (u) + = ul(u > 0). Hence, 
M = p + r. Throughout, we use equally spaced percentiles as knots and set 
p = 3, that is, the cubic splines, unless otherwise noted. Let 6 = ((3, •&) denote 
the parameters of interest. Then, define 6rs = (/3,$) = argmhig ^ C n ((3, 
where 

n n 

(4) £ n 03,tf)=n- 2 ^^( ei - ei )_ 

i=l j=l 

with a = Ti — /3 T B(Xj) — # T Zj and c_ = max(0, — c). Because Model (3) has 
been "linearized," we can apply existing rank-based estimation techniques 
for the usual linear AFT models. In particular, Jin et al. (2003) noted that 
the minimizer of >C n (/3, , «?) is also the minimizer of 



i=l j=\ 



C-(/3 T ,^ T )^^,5 fc A fc 

k=l 1=1 

for a large constant C, where D lk = {M(Xi) T ,Zj} T - {M(X k ) T ,Zj} T . Evi- 
dently, the minimizer of this new loss function may be viewed as the solution 
to a L\ regression of a pseudo response vector V = (Vi, . . . , Vs) T (S x 1) on 
a pseudo design matrix W = (Wi, . . . , Wj) T (S x (M + d)). It can be read- 
ily shown that V is of the form {<5j(Tj — Tj),... ,C} T an d W is of the form 
{SiDij, YZ=i Ya=i $kDik) T , where ^(T- - 7)) and SiDjj go through all i 
and j with <5j = 1, and, hence, S denotes the number of pseudo observations 
in V. Consequently, we have 

s 

(5) RS = argmin V | V s - T W S | . 

The fact that 0rs can be written as the L\ regression estimate facilitates 
the numerical techniques, which will be used for our subsequent estimators. 
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2.2. Penalized regression splines in partly linear AFT models. When re- 
gression splines are used to model nonlinear covariates effects, it is crucial 
to choose the optimal number and location of knots . . . , K r ). It is well 
known that too many knots may lead to overfitting, whereas too few may not 
be sufficient to capture nonlinear effects [Ruppert, Wand and Carroll (2003)]. 
The penalized regression spline regression approach [Eilers and Marx (1996); 
Ruppert and Carroll (1997); Li and Ruppert (2008); Claeskens, Krivobokova 
and Opsomer (2009)] handles this problem by starting with a very large 
number of knots and applying regularization to avoid overfitting. In addi- 
tion, a penalized regression spline with L\ penalty corresponds to a Bayesian 
model with double exponential or Laplace priors and is known to be able to 
accommodate large jumps when using the truncated polynomial basis func- 
tions [Ruppert and Carroll (1997)]. While the truncated power series basis 
is often used for penalized regression spline [Ruppert and Carroll (1997)], 
one can use other bases such as B-splines basis in penalized regression spline 
models and the results should not differ as long as two sets of bases span the 
same space of functions [Li and Ruppert (2008)]. We adopt the L\ penalty 
and consider the penalized regression spline estimator 



(6) 0prs(7) 



f M 1 

argmin^ £„(/3,#) +7 Y] \Pm\>, 



referred to as the partly linear AFT estimator, where 7 is a regulariza- 
tion parameter and is used to achieve the goal of knot selection. Using 
the L\ loss function in (5) and a data augmentation technique for regular- 
ized L\ regression, #prs(7) may be found easily for a given 7. Namely, define 
V* = (V T ,0j) T , W* = [W T ,(0 rxp ,D r ,0 rxd ) T ] T , and D r = 7 / r , where r 
is a r-vector of zeros, rxp (O r xd) is a r x p (r x d) matrix of zeros and I r 
an r-dimensional identity matrix. Then, #prs(7) is found through the L\ 
regression of V* on W*. 7 can be selected through cross-validation or gen- 
eralized cross-validation [Ruppert, Wand and Carroll (2003)]. 

2.3. Variable selection and prediction in partly linear AFT models. Fi- 
nally, we consider variable selection for the high-dimensional features (Z) 
in the partly linear AFT model (3) by extending the penalized regression 
spline estimator #prs(7). Let A be another regularization parameter and 
consider the minimizer to the L\ regularized loss function 



(7) 0pRS(i)(7,A) = ar 



{M d \ 

£„(/3,tf)+ 7 l&»l + A X>il f> 

m=p+l j=l ) 



which is also referred to as the lasso partly linear AFT model estimator. 
The data augmentation scheme used in Section 2.2 applies to the regularized 
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estimator in (7) as well. Define the pseudo response vector V' = (V T , 0j +d ) T 
and the pseudo design matrix 



For fixed 7 and A, the estimate is computed as the L\ regression estimate 
of Vt on W'. To select 7 and A, we can use two approaches, namely, the 
cross-validation (CV) and the generalized cross-validation (GCV) [Tibshi- 
rani (1997); Cai, Huang and Tian (2009)]. The X-fold CV approach chooses 
the values of 7 and A that maximize the Gehan loss function (4). The 
GCV approach chooses the values of 7 and A that maximize the criteria, 
£ n (/3,i?)/(l — cL,\/n) 2 , where n is the number of observations and d yj \ is 
the number of nonzero estimated coefficients for the basis functions (B(X)) 
and linear predictors (Z), that is, the number of nonzero estimates in (/3, •&). 
Note that <i 7i A depends on 7 and A. Once #prs(i) i s obtained, one can build 

prediction scores as <fi(X) + $ Z. 

2.4. Extension to additive partly linear AFT models. When Xj is of q- 
dimension (q > 1) in the partly linear model (1), estimation is more difficult 
due to the issue of curse of dimensionality, even when q is moderately large 
and in the absence of censoring. For our partly linear AFT model, we propose 
to use an additive structure for <f> to further alleviate the problem, namely, 
an additive partly linear AFT model, 



where </>j's (j = 1, . . . , q) are unknown functions. Similar to what is discussed 
in Section 2.2, penalized regression splines can be used for the additive partly 
linear model to conduct knot selection for each nonlinear effect, ■ ) 

(j = 1, . . . , q). The variable selection for Z as discussed in Section 2.3 can also 
be extended to this additive partly linear AFT model. When q is large and 
it is also of interest to conduct feature selection among q additive nonlinear 
effects, one can modify the regularization term for (3 in the loss functions (6) 
and (7); specifically, one can regularize all (3, that is, 7^m=i \Pm\, as op- 
posed to regularizing only the terms that correspond to the set of jumps in 
the pth. derivative, that is, 7 Yl m =p+i I ■ Similarly, we can modify the data 
augmentation scheme to obtain the parameter estimates for these models. 




(8) 




2.5. Numerical implementation for high- dimensional data. In Sec- 
tions 2.1-2.4 the parameters are estimated using L\ regression models through 
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a data augmentation scheme such as (5), which can be readily implemented 
using the quant r eg package in R. While this algorithm works well when 
the total number of parameters is small relative to the sample size, it be- 
comes very slow and starts to fail as the number of parameters gets close 
to or greater than the effective sample size after accounting for censoring. 
As an alternative, we extended a numerical algorithm developed for effi- 
cient computation of rank estimates for AFT models [Conrad and Johnson 
(2010)] to compute the proposed estimators, in particular, the estimator 
in (7). In essence, this method approximates a L\ regularized loss function 
with a smooth function and subsequently optimizes the smoothed objective 
function using a Limited-Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) al- 
gorithm [Nocedal and Wright (2006)], which is implemented in Matlab. This 
method speeds up the computation substantially and can handle the case 
of high-dimensional data. We have compared these two algorithms and they 
give very similar results when both are applicable, that is, Z is of low di- 
mension. 

3. Simulation studies. We conducted extensive simulation studies to eval- 
uate the operating characteristics of the proposed models including estima- 
tion, feature selection and, most importantly, prediction, in comparison with 
several existing models. 

3.1. Estimation. We considered a case of single Zi and single Aj, that 
is, Model (3), and focused on the estimation of the regression coefficient $ 
and its sampling variance. In this setup, no feature selection is involved. 
To facilitate comparisons, our simulation study details were adapted from 
those given by Chen, Shen and Ying (2005) and Johnson (2009). The ran- 
dom variable Z{ was generated from a standard normal distribution, and Aj 
was generated through Aj = 0.25Zj + Ui, where \J% follows a uniform distri- 
bution Un(— 5, 5) and completely independent of all other random variables. 
In Model (3) we let ■& = 1 and Si ~ A(0, 1) and mutually independent of 
(Xi,Zi). We considered linear and quadratic effects, that is, </>(Aj) = 2Aj 
and 0(A"j) = Xf, respectively. Finally, censoring random variables were sim- 
ulated through Cj = 4>{Xi)+Zi-&+U*, where U* follows Un(0, 1). As a result, 
the proportion of censored outcomes ranges from 20% to 30%. We compared 
several estimators, the partly linear AFT model (PL-AFT) with r knots 
(r = 2 and 4), which was fit using the loss function (6), the stratified estima- 
tor in Chen, Shen and Ying (2005) (S^-AFT) where K denotes the number 
of strata, the standard linear AFT model with both Aj and Zi modeled 
linearly (AFT), and an AFT model with true <j) plugged in (AFT-<^>). Two 
sample sizes were used, n = 50 and n = 100. 

Our simulation results show that the CV and GCV methods give similar 
results, so we report only the results using GCV. Table 1 summarizes the 



10 LONG, CHUNG, MORENO AND JOHNSON 

Table 1 

Simulation results for parameter estimation (d) based on 200 Monte Carlo data sets, 

where i? = 1 



cb(X) = 2X 4>(X) = 2X 2 





Bias 


SD 


MSE 


Bias 


SD 


MSE 










n = 50 






PL- AFT (r = 2) 


-12 


159 


25 


-2 


166 


28 


PL- AFT (r = 4) 


-10 


159 


25 


-1 


168 


28 


S 5 -AFT 


95 


288 


92 


-65 


436 


195 


Sio-AFT 


28 


223 


50 


-43 


299 


91 


S 25 -AFT 


31 


303 


93 


-38 


381 


146 


AFT 


-4 


153 


23 


21 


1,214 


1,475 


AFT-0 


-7 


154 


24 


-5 
n = 100 


158 


25 


PL- AFT (r = 2) 


-9 


113 


13 


-2 


115 


13 


PL- AFT (r = 4) 


-9 


113 


13 


-1 


115 


13 


Sio-AFT 


44 


163 


29 


-23 


210 


45 


S25-AFT 


1 


157 


25 


-9 


185 


34 


Sso-AFT 


-7 


193 


37 


8 


209 


44 


AFT 


-8 


113 


13 


71 


755 


575 


AFT-0 


-9 


113 


13 


-2 


111 


12 


Range of SEs 


8-21 


NA 


1-12 


8-86 


NA 


1-209 



PL-AFT, partly linear AFT model with r knots; Sk-AFT, stratified AFT estimator with 
K strata; AFT, standard linear AFT model with both Xi and Zi modeled linearly; and 
AFT-<fi, AFT model with true <j) plugged in. Range of SEs, the range of SEs for the 
corresponding performance measure in each column. NA, SE of a performance measure 
cannot be computed for SD. All numbers are multiplied by 1,000. 



mean bias, standard deviation (SD) and mean squared error (MSE) of •& 
over 200 Monte Carlo data sets, and it also provides the range of standard 
errors for the performance measure in each column, where all numbers are 
multiplied by 1,000. In all cases, the proposed partly linear AFT estimator 
outperforms the stratified estimators as well as the standard AFT estimator 
in terms of MSE, and its performance is comparable to that of the estimator 
using the true <j). The number of knots has little impact on the performance 
of our proposed estimator. The standard linear AFT estimator exhibits the 
largest bias and MSE when cfi is not linear, indicating that it is important 
to adjust for the nonlinear effect of X even when one is only interested in 
the effect of Z. While the stratification step in the S^-AFT method results 
in reduced bias when the number of strata is large, it has larger SD and 
MSE compared to PL-AFT. Furthermore, in the settings of our interest, no 
method has been proposed for choosing K in the S^-AFT method, which 
is not obvious either, leading to a further shortcoming of this method over 
the others. 
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3.2. Feature selection. In our second set of simulation studies, we fo- 
cused on simultaneous estimation and feature selection for Zj as well as 
prediction. The regression function still consisted of a nonlinear effect of 
a single covariate Xi, but we increased the dimension of the linear pre- 
dictors (Zj) to d = 8. Zj were generated from a multivariate normal with 
a mean equal to 0^ and (j, /c)th element of the covariance matrix equal 
to (p = 0,0.5,0.9). The covariate Xi was generated through Xi = 

Q.hZu + 0.5Z2j + 0.5Zzi + Ui, where Ui is Un(— 1,1) and independent of 
all other random variables. This corresponds to a case where Z\ and Z<i 
have both direct and indirect effects through X on the outcome, whereas Z% 
has only an indirect effect on the outcome. The true regression coefficients 
for Z are set to = (A, A, 0, 0, 0, A, 0, 0)', where A = 1 and 0.5 repre- 
sent a strong signal (effect size) and a weak signal (effect size), respec- 
tively. In this case, the three important covariates (namely, Z\, Z2 and Z%) 
can potentially be highly correlated. The effect of Xi was generated from 
(f>(Xi) = (0.2 * Xi + 0.5 * X? + 0.15 * Xf)I(Xi > 0) + (0.05 * Xj)J(Xj < 0), 
where /(•) is the indicator function. This setup mimics a practical setting 
where the effect of the clinical variable (X) on the outcome is ignorable 
when X is less than a threshold level (X = 0); but as X increases past the 
threshold level, its effect becomes appreciable. The log survival time Tj was 
then generated using equation (3), where £j follows iV(0, 1) and is mutually 
independent of (Aj,Zj). The censoring random variable was simulated ac- 
cording to the rule, C, = 4>{Xi) + # r Zj + U* , where U* follows the uniform 
distribution Un(0, 6). The resulting proportion of censoring ranges from 20% 
to 30%. 

We compared six models: (1) the lasso partly linear AFT model (Lasso- 
PL) with r = 6 which was fit using the loss function (7); (2) the lasso strat- 
ified model (Lasso-S^-) [Johnson (2009)] where K denotes the number of 
strata; (3) the lasso linear AFT model assuming a linear effect for both Xi 
and Zj (Lasso-L); (4) the standard linear AFT model (AFT); (5) the lasso 
linear Cox PH model assuming a linear effect for both Xi and Zj (Lasso- 
Cox) [Tibshirani (1997); Goeman (2010)]; and (6) the so-called oracle partly 
linear model (Oracle) with $3, $4, $5, $7 and #8 fixed at and r = 6 for 
the penalized splines. We are not aware of any existing Cox PH model that 
can handle both nonlinear covariate effects and feature selection in high- 
dimensional data. Since the data were generated under a true AFT model 
and the PH assumption underlying the Cox model is violated, we are pri- 
marily interested in feature selection when comparing the Lasso-Cox model. 
The oracle model, while unavailable in practice, may serve as an optimal 
benchmark for the purpose of comparisons. In each instance of regularized 
methods, GCV was used to tune the regularization parameters, A and/or 7. 

In each simulation run, a training sample of size n = 125 and a testing 
sample of size lOn were generated. To evaluate parameter estimation, we 
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monitored the sum of squared errors (SSE) for # denned as (i9 — 19) (# — #). 
To evaluate feature selection, we monitor the proportion of zero coeffi- 
cients being set to zero (P c = £- = i = 0)7(0* = 0)/ £ti = °))> for 
which 1 is the optimal value, and the proportion of nonzero coefficients being 
set to zero (Pj = £? =1 = 0)J(^ ^ 0)/Y*=l 7(tfi ^ 0)), for which is the 
optimal value. To assess the prediction performance, we considered two mean 
squared prediction errors, MSPEi = (lOn)" 1 Ylf=i[4>( x j) ~ <t>( x j) + (& - 

tf) T Zj} 2 , and MSPE 2 = (lOn)" 1 J2]°=i0 ~ #) Tz jf, where 3 goes through 
the observations in the testing sample. MSPEi is the squared prediction 
error using both nonlinear and linear components in Model (3), and MSPE2 
is the squared prediction error using only linear components in Model (3). 
For AFT models, MSPEi and MSPE2 can be considered as metrics of pre- 
diction performance on the log-transformed scale. Note that the stratified 
Lasso model does not provide an estimate of 4>(X), so MSPEi is not appli- 
cable for Lasso-S^. For each simulation setting, the performance measures 
were averaged over 400 Monte Carlo data sets. For the performance measure 
in each column, the range of standard errors was computed. 

Our simulation results are summarized in Table 2. First, the performance 
of the standard linear AFT model (AFT) is not satisfactory in terms of both 
prediction and feature selections. We now restrict the discussion to the regu- 
larized estimators. In all cases, our Lasso-PL estimator exhibits lowest SSE, 
MSPEi and MSPE2 among regularized estimators; in particular, its MSPEi 
and MSPE2 are comparable to that of the Oracle estimator and are substan- 
tially lower than other regularized estimators. In terms of feature selection, 
Lasso-PL, Lasso-L and Lasso-Cox correctly identify the majority of the re- 
gression coefficients that are zero (Pc)', Lasso-PL has higher Pc than Lasso-L 
when p = or 0.5 and their Pc's are comparable in the presence of high cor- 
relation (p = 0.9); and Lasso-L has considerably higher Pc than Lasso-Cox 
in all cases. By comparison, the lasso stratified models (Lasso-S^) only iden- 
tify less than 30% of true zeros in some cases and roughly half of the true 
zeros in the rest of the cases. When there is no correlation and the signal 
is strong, all Lasso estimators successfully avoid setting nonzero coefficients 
to zero, that is, Pi equal to or close to 0. However, as the correlation gets 
stronger, P\ increases for all estimators to various degrees. When p = 0.9, Pi 
becomes appreciable for Lasso-L, whereas it remains moderate for Lasso-PL. 

3.3. Prediction in the presence of high- dimensional data. We conducted 
a third set of simulations to explore the impact of noise levels on the predic- 
tion performance in the presence of high-dimensional data (i.e., d>n), and 
compared four models, namely, Lasso-PL, Lasso-S^-, Lasso-L and Lasso- 
Cox. We note that the standard AFT model is not applicable for high- 
dimensional data. The simulation setup paralleled that in Section 3.2. The 
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Table 2 

Simulation results for evaluating feature selection and prediction performance based on 
400 Monte Carlo data sets, where n = 125 and d = 8 









A : 


= 1 








A = 


0.5 






SSE 


Pc 


Pi 


MSPEi 


MSPE 2 


SSE 


Pc 


Pi 


MSPEi 


MSPE; 


Lasso-PL 


8 


734 





244 


P 

67 


= 

8 


724 





237 


67 


Lasso-S2 


2-3 


482 





NA 


186 


23 


453 


1 


NA 


185 


Lasso-S4 


16 


582 





NA 


127 


15 


565 


2 


NA 


122 


Lasso-Ss 


20 


424 





NA 


161 


20 


438 


8 


NA 


159 


Lasso-L 


12 


639 





997 


100 


12 


611 





990 


99 


Lasso-Cox 


NA 


488 





NA 


NA 


NA 


543 


17 


NA 


NA 


AFT 


18 








982 


142 


18 








982 


143 


Oracle 


4 


1,000 





153 


20 

P = 


4 

= 0.5 


1,000 





207 


30 


Lasso-PL 


11 


767 





225 


74 


11 


777 


2 


296 


75 


Lasso-S2 


38 


403 





NA 


341 


40 


412 


8 


NA 


353 


Lasso-S4 


21 


569 





NA 


171 


20 


599 


5 


NA 


146 


Lasso-Ss 


26 


540 





NA 


218 


26 


594 


15 


NA 


204 


Lasso-L 


19 


720 





2,894 


126 


19 


748 


16 


2,943 


121 


Lasso-Cox 


NA 


562 





NA 


NA 


NA 


612 


14 


NA 


NA 


AFT 


33 








2,839 


212 


32 








2,878 


202 


Oracle 


5 


1,000 





175 


31 

P = 


5 

= 0.9 


1,000 





248 


32 


Lasso-PL 


45 


739 


2 


373 


118 


39 


758 


113 


337 


130 


Lasso-S2 


126 


502 


16 


NA 


592 


106 


500 


152 


NA 


595 


Lasso-S4 


77 


582 


4 


NA 


184 


60 


596 


124 


NA 


170 


Lasso-Ss 


118 


236 


6 


NA 


338 


96 


424 


135 


NA 


390 


Lasso-L 


92 


751 


31 


6,571 


245 


65 


778 


270 


6,738 


262 


Lasso-Cox 


NA 


596 


8 


NA 


NA 


NA 


651 


153 


NA 


NA 


AFT 


224 








6,483 


337 


226 








6,612 


354 


Oracle 


17 


1,000 





320 


55 


17 


1,000 





288 


54 


Range of SEs 


0.1-8 


0-24 


0-5 


8-76 


1-23 


0.2-8 


0-26 


0-13 


10-81 


1-25 



Lasso-PL, Lasso partly linear AFT model; Lasso-Sx, Lasso stratified model with K strata; 
Lasso-L, Lasso linear AFT model assuming a linear effect for both Xi and Z;; Lasso-Cox, 
Lasso linear Cox model assuming a linear effect for both Xi and Z;; AFT, standard AFT 
model assuming linear effects for both Xi and Z; without regularization; and Oracle, oracle 
partly linear model with zero coefficients being set to 0. A, effect size; SSE, sum of squared 
errors for 1?; Pc, proportion of zero coefficients being set to zero; Pi, proportion of nonzero 
coefficients being set to zero; MSPEi, squared prediction error using both nonlinear and 
linear components; and MSPE2, squared prediction error using only linear components. 
Range of SEs, range of SEs for the corresponding performance measure in each column. 
NA, a performance measure is not applicable for an estimator. All numbers are multiplied 
by 1,000. 
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differences are noted as follows. The sample size was fixed to n = 100 and the 
number of linear predictors was d > 100, and let #i = #26 = #51 = $76 = 1 
and all other $'s be 0. Let X = O.5Z10 + O.5Z35 + 0.5^60 + Ui, where Ui follows 
Un(— 1, 1). Through these changes, we investigated a case where the signifi- 
cant linear predictors (Z) are not highly correlated. The censoring random 
variable was generated similar to that in Section 3.2 with a different uni- 
form distribution such that the censoring probability is approximately 40%. 
Since MSPEi and MSPE2 are not applicable in the presence of censoring 
in practice, we computed another metric of prediction performance using 
the testing sample, namely, the c statistic for censored data, which mea- 
sures the proportion of concordance pairs based on observed and predicted 
outcomes and ranges between and 1 with 1 indicating perfect prediction 
[Kattan (2003a); Kattan (2003b); Steyerberg et al. (2010)]. In particular, the 
comparison with Lasso-Cox is focused on c statistics. Again, for Lasso-S^-, 

MSPEi was not applicable and i9 Zj was used to compute the c statistic; 
for the performance measure in each column, the range of standard errors 
was computed. 

Table 3 summarizes the prediction performance for d = 100, d = 500 
and d = 1,500 over 400 Monte Carlo data sets. In the presence of high- 
dimensional data, Table 3 shows that the proposed Lasso-PL always achieves 
the best prediction performance in terms of the c statistic as well as MSPEi 
and MSPE2, and Lasso-Cox always has lower c than Lasso-PL and Lasso-L. 
By and large, the prediction performance of Lasso-S^- is comparable to that 
of Lasso-L and is considerably worse than Lasso-PL in all cases, and, in par- 
ticular, the absence of the estimated nonlinear effect in X leads to substantial 
loss in the c statistic. While Lasso-PL estimates the nonlinear effect of X 
well in all cases, the prediction error due to the linear predictors (MSPE2) 
starts to dominate as d increases. Since all significant predictors are in the 
first 100 predictors, the cases of d = 1,500 and d = 500 simply add 1,100 
and 400 noise predictors, respectively, compared to the case of d = 100. Our 
results indicate that as the noise level increases the prediction performance 
deteriorates for all models. For Lasso-L models, the prediction error due to 
misspecified nonlinear effect of X remains substantial in all cases. In this 
setup, when correlation is weak or moderate (p = or 0.5), the impact of 
correlation on prediction performance is moderate, in particular, in terms 
of c; however, as correlation becomes very strong (p = 0.9), the prediction 
performance improves considerably in terms of c for all methods. 

We performed additional simulations for a higher censoring rate, 60%, and 
for different regression coefficient values, for example, #1 = #2 = #3 = #50 = 1 
and all other i?'s set to 0, that is, the first three significant predictors are 
highly correlated. Under all scenarios, the results on comparisons between 
different models remain the same, but the prediction performance worsens 
as the censoring rate increases. 
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Table 3 

Simulation results for evaluating prediction performance in the presence of 
high- dimensional data based on 400 Monte Carlo data sets, where n = 100 



<f=100 d = 500 d= 1,500 

MSPEi MSPE 2 c MSPEi MSPE 2 c MSPEi MSPE 2 c 



p = 



Lasso-PL 


412 


349 


860 


989 


897 


840 


1,685 


1,543 


796 


Lasso-S2 


NA 


676 


811 


NA 


1,589 


768 


NA 


2,310 


711 


Lasso- S4 


NA 


560 


812 


NA 


1,428 


780 


NA 


2,182 


718 


Lasso-Ss 


NA 


529 


811 


NA 


1,454 


775 


NA 


2,208 


716 


Lasso-L 


1,441 


568 


829 


2,752 


1,666 


784 


3,719 


2,496 


697 


Lasso-Cox 


NA 


NA 


798 


NA 


NA 


749 


NA 


NA 


684 


Lasso-PL 


389 


330 


860 


P 

1,034 


= 0.5 
937 


839 


1,659 


1,518 


797 


Lasso-S2 


NA 


637 


810 


NA 


1,653 


766 


NA 


2,270 


716 


Lasso- S4 


NA 


525 


812 


NA 


1,472 


777 


NA 


2,152 


725 


Lasso-Ss 


NA 


491 


811 


NA 


1,512 


774 


NA 


2,196 


721 


Lasso-L 


1,418 


550 


829 


2,803 


1,720 


781 


3,703 


2,513 


701 


Lasso-Cox 


NA 


NA 


799 


NA 


NA 


749 


NA 


NA 


690 


Lasso-PL 


387 


328 


875 


P 

1,084 


= 0.9 
1,124 


852 


1,795 


1,909 


811 


Lasso-S2 


NA 


529 


841 


NA 


1,314 


815 


NA 


2,059 


769 


Lasso-S4 


NA 


474 


842 


NA 


1,422 


812 


NA 


2,253 


759 


Lasso-Ss 


NA 


455 


841 


NA 


1,618 


805 


NA 


2,473 


744 


Lasso-L 


1,476 


480 


852 


2,274 


1,152 


836 


3,179 


1,849 


802 


Lasso-Cox 


NA 


NA 


840 


NA 


NA 


825 


NA 


NA 


796 


Range of SEs 


9-20 


8-23 


0.6-2 


32-56 


32-52 


1-4 


47-61 


47-57 


2-5 



Lasso-PL, Lasso partly linear AFT model; Lasso-Sx, Lasso stratified model with K strata; 
Lasso-L, Lasso linear AFT model assuming a linear effect for both Xi and Z;; and Lasso- 
Cox, Lasso linear Cox model assuming a linear effect for both Xi and Z;. MSPEi, the 
squared prediction error using both nonlinear and linear components; MSPE2 , the squared 
prediction error using only linear components; and c, the c-statistic for censored data. 
Range of SEs, range of SEs for the corresponding performance measure in each column. 
NA, a performance measure is not applicable for a estimator. All numbers are multiplied 
by 1,000. 

In summary, the proposed lasso partly linear AFT model achieves best 
performance in all three areas: estimation, feature selection and prediction. 
While the lasso stratified estimator performs reasonably well in estimation, 
its performance in feature selection and prediction is not satisfactory. When 
a covariate effect is nonlinear, the performance of Lasso-L worsens, and 
the deterioration can be substantial in terms of prediction. When the PH 
assumption does not hold, the performance of Lasso-Cox is considerably 
worse than Lasso-L. Furthermore, if prediction is of primary interest, our 
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results suggest that it is advantageous to build prediction scores using data 
with less noise variables. 

4. Data analysis: The prostate cancer study. We analyzed the data from 
the prostate cancer study, which included 78 patients. The outcome of inter- 
est is time to prostate cancer recurrence, which starts on the day of prosta- 
tectomy and is subject to censoring; the observed survival time ranges from 
2 months to 160 months and the censoring rate is 57.7%. In the data anal- 
ysis, the log-transformed survival time was used to fit AFT models. Gene 
expression data using 1,536 probes and two clinical variables (PSA and glea- 
son score) were measured from samples collected at the baseline (i.e., right 
after the surgery) and were used in our analysis. Since replicate RNA sam- 
ples were collected and measured from some subjects, we averaged the gene 
expression data over multiple RNA samples from a same subject before sub- 
sequent analysis. The gleason score in this data set ranges only between 5 
and 9 and 91% of patients had a score of either 6 or 7; combining this with 
suggestions from the investigators, the total gleason score was dichotomized 
as > 7 or not. 

Before the data analysis, all gene expression measurements were prepro- 
cessed and standardized to have mean and unit standard deviation. Sub- 
sequently, Cox PH models were fit for each individual probe and all probes 
were then ranked according to their score test statistics from the largest 
(J = 1) to the smallest (J = 1,536). This ranking procedure serves two pur- 
poses. First, it simplifies the presentation of the results, since we can refer to 
each probe using its ranking. Second, a pre-selection step using this ranking 
procedure is used when evaluating the prediction performance in Section 4.2, 
which is similar to what is often used in detecting differentially expressed 
genes. We note that the use of Cox PH models is of no particular impor- 
tance, which simply provides a way to rank the probes; one can use other 
models such as AFT models. 

4.1. Feature selection. Before building prediction scores, we conducted 
feature selection using the following models: the Lasso-PL with r = 10, 
Lasso-S^-, Lasso-L and Lasso-Cox. In the Lasso-PL model (3), Xi is PSA, 
which is modeled using penalized splines, and Z includes the binary clini- 
cal variable, gleason score, as well as the complete set or a subset of 1,536 
probes. Similarly, in the Lasso-S^- model, stratification is based on PSA. 

We first conducted an analysis using the complete set of 1,536 probes. 
The results on feature selection are summarized in Table 4. A linear effect 
of PSA was included in the Lasso-L model and was estimated to be nonzero, 
which further justifies the inclusion of PSA in other models; on the other 
hand, the total gleason score is not selected by any of the methods. Figure 1 
shows the estimated effect of PSA using Lasso-PL; specifically, the time to 
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Table 4 

Feature selection for the prostate cancer study 



Method 


Selected probes 


Lasso-PL 


1,2,4,12,16,31,38,46,63 


Lasso-S2 


1, 4, 8, 12, 16, 31, 46, 63, 382, 906 


Lasso- S4 


1, 4, 12, 16, 29, 31, 36, 38, 46, 56, 70, 78, 310, 382, 390, 591, 1,500 


Lasso-Ss 


1, 4, 8, 9, 16, 18, 31, 36, 37, 38, 46, 56, 57, 70, 78, 178, 237, 271, 310, 855, 1,500 


Lasso-L 


1,2,4,8,9,16,31,46,63,70,136 


Lasso-Cox 


2,4,8,11,14,16,22,31,46,52,63 



recurrence initially decreases as PSA increases and then starts to increase 
slightly as PSA goes beyond 11. After further examination of the data, we 
found that most patients had PSA values ranging from 0-15.2, but three 
had PSA values of 18.43, 26 and 32.10. More importantly, all subjects with 
PSA > 15.2 had censored outcomes; consequently, it is not appropriate to 
project the estimated 4>{X) beyond 15.2. We also suspect that the increasing 
trend toward the right tail is an artifact of the data and the effect of PSA 
instead levels off when it is greater than 11, given that an increase in the 
time to recurrence as PSA increases does not seem plausible clinically. 

In terms of feature selection for the probe data, the Lasso-PL model 
selects the least number of features, among which Probe 4, 16, 31 and 46 
are selected by all six models, Probe 1 selected by five models, Probe 63 
selected by four models and Probe 2, 12 and 38 selected by three models. 
In other words, all probes selected by Lasso-PL are selected by at least half 
of all models, whereas other models select some probes that are not shared 
by the rest of the models and are likely to be noise. This agrees with the 



id 




n 1 1 [— 

5 10 15 

PSA 



Fig. 1. Estimated nonlinear effect of PSA on the prostate cancer recurrence after 
surgery (4>(X) ). 
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simulation results, that is, in the presence of moderate to strong correlation 
among predictors, the other models tend to select a larger number of noise 
features. In addition, the difference between the Lasso-PL method and the 
Lasso-L method is likely due to the nonlinear effect of PSA. 

4.2. Prediction performance. To internally evaluate the prediction per- 
formance, the data were randomly split into a training sample (60%) and 
a validation sample (40%). Due to the high censoring rate, this step was 
stratified on the censoring status to avoid extreme imbalance of censoring 
rates between the training and validation samples. The models of interest 
were fit using the training sample and were then used to construct the pre- 

dictive risk score for cancer recurrence, say, 4>(X) + # Z for Lasso-PL, for 
subjects in the validation sample. Subsequently, the c statistic was com- 
puted in the validation sample. This procedure was repeated 1,000 times 
and the average c statistic is used for evaluating the prediction performance 
of different models. 

We compared the following model and data combinations: Lasso-PL with 
r = 10 using 1,536 probes and 2 clinical variables with PSA modeled non- 
linear ly; Lasso-L and Lasso-Cox using 1,536 probes and 2 clinical variables; 
Lasso-PL with r = 10 using 2 clinical variables plus top 25 probes with 
PSA modeled nonlinearly, where the top 25 probes were selected within 
each training sample; Lasso-L and Lasso-Cox using 2 clinical variables plus 
top 25 probes; partly linear AFT and Cox models (PL-AFT and PL-Cox) 
using 2 clinical variables only with PSA modeled nonlinearly through a pe- 
nalized spline; linear AFT and Cox model (AFT and Cox) using 2 clinical 
variables only. Note that we did not use Lasso-Sj<-, since it does not estimate 
the nonlinear effect of PSA. 

Table 5 presents the mean c statistic computed using each model and 
data combination. Partly linear models have higher average c than linear 
models in all settings and for both AFT and Cox models, indicating that 
the misspecified effect of PSA leads to worse prediction performance. In 
all cases, AFT models have similar or higher average c compared to their 
corresponding Cox models. The average c for Lasso-PL using all 1,536 probes 
is slightly less than PL-AFT using only clinical variables, whereas Lasso-L 
and Lasso-Cox using all 1,536 probes have substantially lower c than AFT 
and Cox using only clinical variables. Furthermore, when a pre-selection step 
was included to choose the top 25 probes first, we observe small improvement 
in c for Lasso-L and Lasso-Cox and no improvement for Lasso-PL, which is 
likely due to that the correctly modeled PSA effect plays the most important 
role in prediction and the addition of gene expression data does not seem to 
further improve prediction. 

In summary, our analyses suggest that (1) the relationship between the 
baseline PSA and prostate cancer recurrence is likely nonlinear, that is, 
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Table 5 

Prediction performance in the data analysis: mean c 
statistic 



All 1,536 probes 



Lasso-PL 


Lasso-L 




Lasso-Cox 


0.653 


0.561 




0.553 


Top 25 probes 


Lasso-PL 


Lasso-L 




Lasso-Cox 


0.653 


0.567 




0.572 


Clinical variables only 


PL- AFT 


AFT 


PL-Cox 


Cox 


0.665 


0.644 


0.658 


0.644 



the time to recurrence decreases as PSA increases and it starts to level off 
when PSA becomes greater than 11; (2) the correct specification of this 
nonlinear effect improves performance in prediction and feature selection; 
and (3) the addition of gene expression data does not seem to further improve 
the prediction performance. However, given that the sample size in this study 
is small, our results need to be validated in a future study, preferably with 
a larger sample size. 

5. Discussion. We have investigated statistical approaches for prediction 
of clinical end points that are subject to censoring. Our research shows that 
correctly specifying nonlinear effects improves performance in both predic- 
tion and feature selection for both low-dimensional and high-dimensional 
data. While the proposed models can be used for high-dimensional data, 
caution needs to be exercised in practice, since the sample size is often small 
in real- life studies. This is especially true when prediction is of primary in- 
terest and feature selection is less of a concern. As the regularized methods 
achieve sparsity, they shrink the coefficients of the important predictors. In 
finite samples, such shrinkage becomes more pronounced as the noise level 
(i.e., the number of noise predictors) increases; as a result, the prediction 
performance deteriorates, which is reflected in our simulations and data 
analysis. 

We investigated two numerical methods for fitting proposed models. The 
first algorithm is implemented through a L\ regression, which is slow for 
large data sets or when the number of predictors is large relative to the 
sample size and fails when d> n. These limitations are especially serious for 
censored data. For example, in our data example, the first algorithm started 
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to have convergence issues if d > 25 probes were used, in particular, when 
cross-validation was used or internal validation was performed for evaluating 
prediction performance. The second algorithm as described in Section 2.5 
can deal with high-dimensional data, and its solutions are fairly close to those 
obtained using the first method when both are applicable. Consequently, we 
recommend the use of the second algorithm in practice. 

In this paper we focus on the performance for prediction as well as feature 
selection in finite samples through extensive numerical studies, and the the- 
oretical properties of the proposed methods are likely inherited from those 
of regularized linear AFT models and penalized splines, which are beyond 
the scope of this article and are a topic for future research. Nevertheless, 
our numerical results provide empirical evidence to suggest that the pro- 
posed approach is likely to enjoy the properties on feature selection that are 
possessed by regularized estimation in linear AFT models [Cai, Huang and 
Tian (2009)] and in stratified AFT models [Johnson (2009)]. 

Several metrics have been proposed for assessing the performance of pre- 
diction models, and Steyerberg et al. (2010) provide a nice review on this 
subject; however, it is well known that censoring presents additional chal- 
lenges in developing such metrics [Begg et al. (2000); Gonen and Heller 
(2005); Steyerberg et al. (2010)]. In our simulations and data example, we 
used the extended c statistic to evaluate the prediction performance in the 
presence of censored data; despite its ease of use, this metric uses only con- 
cordant and disconcordant information and hence leads to loss of informa- 
tion. Furthermore, while the existing metrics for censored data are appli- 
cable for AFT models, no metric has been proposed to take advantage of 
the unique feature of AFT models, namely, they model the log-transformed 
outcome and can provide prediction on the log-transformed scale, which is 
not trivial and is another topic for our future research. 
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